For this week’s lab, we’ll take a look at one of the data sets we introduce in PSTAT 131/231 – one of the most commonly practiced with data sets on Kaggle, the Titanic data. Although we do fit some models to the Titanic data in 131/231, the feature engineering we do in that course is relatively limited. If we delve more deeply, we can potentially achieve better model performance.
Data
As a reminder, the Titanic dataset comes from a Kaggle competition that is constantly running, where many data scientists practice models to hone their skills. You can find a link to the competition here. It is a binary classification problem, where the goal is to predict which passengers would survive the Titanic shipwreck.
The following code loads the data from data/titanic.csv. You can familiarize yourself with the variables it contains using the codebook (data/titanic_codebook.txt) or with the information below.
# A tibble: 6 × 12
passenger_id survived pclass name sex age sib_sp parch ticket fare cabin
<dbl> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <chr>
1 1 No 3 Brau… male 22 1 0 A/5 2… 7.25 <NA>
2 2 Yes 1 Cumi… fema… 38 1 0 PC 17… 71.3 C85
3 3 Yes 3 Heik… fema… 26 0 0 STON/… 7.92 <NA>
4 4 Yes 1 Futr… fema… 35 1 0 113803 53.1 C123
5 5 No 3 Alle… male 35 0 0 373450 8.05 <NA>
6 6 No 3 Mora… male NA 0 0 330877 8.46 <NA>
# ℹ 1 more variable: embarked <chr>
Code
import numpy as npimport pandas as pdimport reimport matplotlib.pyplot as pltimport seaborn as snssns.set(style="darkgrid")from sklearn.ensemble import RandomForestClassifierfrom sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScalerfrom sklearn.metrics import roc_curve, aucfrom sklearn.model_selection import StratifiedKFoldfrom sklearn.model_selection import train_test_splitimport stringimport warningswarnings.filterwarnings('ignore')titanic = r.titanic
There are \(891\) rows and \(12\) columns in the dataset. The columns are:
passenger_id: Unique identifier of rows/passengers. Not likely to have any effect on the outcome;
survived: The target variable, a binary variable with values "Yes" (passenger survived) or "No" (passenger did not survive);
pclass: Socioeconomic status of the passenger, an ordinal feature with three values, 1 = upper class, 2 = middle class, 3 = lower class;
name: Passenger’s full name, last name first;
sex: Passenger sex, a binary variable with values "male" and "female";
age: Passenger age in years;
sib_sp: Total number of passengers’ siblings and/or spouses on board;
parch: Total number of passengers’ parents and/or children on board;
ticket: Passengers’ ticket number;
fare: Passengers’ fare;
cabin: Passengers’ cabin number;
embarked: Point of embarkation, a categorical feature with three values, "C" = Cherbourg, "Q" = Queenstown, "S" = Southampton.
Missing Data
To assess the amount and pattern of missingness in R, we could run:
Code
library(naniar)
Warning: package 'naniar' was built under R version 4.3.2
Code
vis_miss(titanic)
This would reveal to us that there are missing values of age and cabin. Without further investigation, we might not realize that there are also missing values of embarked, but there are:
Code
titanic$embarked %>%is.na() %>%mean()
[1] 0.002244669
In Python, we could diagnose the number and pattern of missing values:
Code
def display_missing(df): for col in df.columns.tolist(): print('{} column missing values: {}'.format(col, df[col].isnull().sum()))print('\n')display_missing(titanic)
We will replace missing values of age with median imputation, but rather than using the overall median, we’ll replace any missing values with the corresponding median of age for that level of passenger class and sex. In other words, if a first-class male passenger is missing their age value, we will replace their missing value with the median age of all other first-class male passengers, and so on.
The reasoning behind this is based on passenger class’s moderate correlation with age and survived (and because we can logically reason that passenger sex, which also has a strong relationship with survival, might be related as well). To check the correlations:
Let’s look at the two passengers who are missing values for the embarked variable:
Code
titanic %>%filter(is.na(embarked))
# A tibble: 2 × 12
pclass sex passenger_id survived name age sib_sp parch ticket fare cabin
<dbl> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <chr>
1 1 fema… 62 Yes Icar… 38 0 0 113572 80 B28
2 1 fema… 830 Yes Ston… 62 0 0 113572 80 B28
# ℹ 1 more variable: embarked <chr>
They were traveling together, since they shared the same cabin number and ticket number. Googling one of their names tells us that Mrs. Stone “boarded the Titanic in Southampton on 10 April 1912 and was travelling in first class with her maid Amelie Icard. She occupied cabin B-28.” We can then simply replace those two missing values of embarked with "S" for Southampton.
The cabin feature is a little bit tricky and needs further exploration. As we saw previously, about \(77\%\) of passengers are missing information about the cabin in which they resided.
We could simply drop these rows, but there is useful information in the cabin variable that we might want to take advantage of, if possible; the letter of cabin tells us what deck the cabins are located in.
As deck goes from A to G, distance to the staircase increases, which may well have been a factor in survival; it would be useful to retain that information.
We’ll label missing values of cabin as "M" and treat them as a separate deck.
titanic['deck'] = titanic['cabin'].apply(lambda x: re.search(r"([A-Z]+)", x).group(0) if pd.notna(x) else"M")
If we then create a percent stacked bar chart to look at the distribution of passenger classes across decks:
Code
titanic %>%ggplot(aes(x = deck, fill =factor(pclass))) +geom_bar(position ="fill")
In Python:
Code
plt.figure(figsize=(8, 6))sns.histplot( data=titanic, x='deck', hue='pclass', multiple='fill', shrink=0.8, palette="Set2")plt.xlabel("Deck")plt.ylabel("Proportion")plt.title("Proportion of Passengers by Deck and Pclass")plt.show()
(Note that Python doesn’t automatically put the levels of the deck variable in alphabetical order.)
We can see the following:
\(100\%\) of A, B, C, and T decks are first-class passengers;
Deck D is about \(87\%\) first-class and \(13\%\) second-class passengers;
Deck E is about \(83\%\) first-class, \(10\%\) second-class, and \(7\%\) third-class passengers;
Deck F is about \(62\%\) second-class and \(38\%\) third-class passengers;
\(100\%\) of Deck G is third-class passengers;
passengers missing cabin information are mainly second- and third-class passengers.
We might then want to look at the count of observations in each of these categories, to make sure there are no categories with only a very few observations:
Code
titanic %>%group_by(deck) %>%count()
# A tibble: 9 × 2
# Groups: deck [9]
deck n
<chr> <int>
1 A 15
2 B 47
3 C 59
4 D 33
5 E 32
6 F 13
7 G 4
8 M 687
9 T 1
There is only one person in the data set in T-deck, and since they are a first-class passenger (and therefore more similar to A-, B-, and C-deck), we’ll group them with A-deck.
We’ll combine A, B, and C decks together, D and E decks together, and F and G decks together. This is because their distributions of passenger classes are relatively similar, and the number of passengers in these decks is relatively low, which might make working with the individual deck membership difficult later on.
We will also now drop cabin because we’ve retained the useful information from it.
If we preferred to use Python, we could split the data as shown (notice that we are stratifying on the outcome variable, since we know from previous experience that it is imbalanced):
For each of the continuous features, we’ll look at their distribution with a histogram and a density curve, broken down by level of the outcome, survived.
Fare
Here are both those plots for the distribution of fare, first in R, then in Python:
Code
titanic_train %>%ggplot(aes(x = fare, fill = survived)) +geom_histogram(alpha =0.3)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Code
titanic_train %>%ggplot(aes(x = fare, fill = survived)) +geom_density(alpha =0.3)
Code
titanic_train['survived'] = titanic_train['survived'].astype('category')# Histogram plotplt.figure(figsize=(10, 5))sns.histplot( data=titanic_train, x='fare', hue='survived', element='step', stat='density', common_norm=False, alpha=0.3, palette='Set1')plt.title("Fare Distribution by Survival (Histogram)")plt.xlabel("Fare")plt.ylabel("Density")plt.show()
Code
# Density plotplt.figure(figsize=(10, 5))sns.kdeplot( data=titanic_train, x='fare', hue='survived', fill=True, alpha=0.3, palette='Set1')plt.title("Fare Distribution by Survival (Density Plot)")plt.xlabel("Fare")plt.ylabel("Density")plt.show()
The fare feature is positively skewed, and survival rates are very high on the high end. We’ll divide this feature into 13 quantile-based bins. Using quantiles to create the bins will ensure that we have the same number of observations in each group. Creating bins like this does reduce the variability of the feature, but may help models like elastic net pick up on relevant places to split the features. (In practice, it’s worth trying both ways and comparing the cross-validation error for each.)
Here is the distribution of the binned variable, broken down by levels of the outcome:
titanic_train['fare_binned'] = pd.qcut(titanic_train['fare'], q=13, labels=False) +1titanic_train['fare_binned'] = titanic_train['fare_binned'].astype(str)plt.figure(figsize=(10, 6))sns.histplot( data=titanic_train, x='fare_binned', hue='survived', multiple='fill', shrink=0.8, palette='Set1')plt.xlabel("Fare Binned (13 Quantiles)")plt.ylabel("Proportion")plt.title("Proportion of Survival by Fare Bins")plt.show()
Age
Here are both those plots for the distribution of age, first in R, then in Python:
Code
titanic_train %>%ggplot(aes(x = age, fill = survived)) +geom_histogram(alpha =0.3)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Code
titanic_train %>%ggplot(aes(x = age, fill = survived)) +geom_density(alpha =0.3)
Code
plt.figure(figsize=(10, 5))sns.histplot( data=titanic_train, x='age', hue='survived', element='step', stat='density', common_norm=False, alpha=0.3, palette='Set1')plt.title("Age Distribution by Survival (Histogram)")plt.xlabel("Age")plt.ylabel("Density")plt.show()
Code
# Density plotplt.figure(figsize=(10, 5))sns.kdeplot( data=titanic_train, x='age', hue='survived', fill=True, alpha=0.3, palette='Set1')plt.title("Age Distribution by Survival (Density Plot)")plt.xlabel("Age")plt.ylabel("Density")plt.show()
This variable is not really normally distributed, but it also isn’t extremely skewed. We could attempt to apply a transformation to bring the distribution closer to normal, or we could leave it as is. Another option would be to bin this variable. We’ll choose to leave it as is.
Categorical Features
Family Size
Let’s consolidate the information from the two variables sib_sp and parch into one, creating a new variable that we’ll call family_size. We can sum sib_sp and parch, then add one to represent the individual themselves. After doing so, looking at a bar chart reveals that there are a number of family sizes with very few observations, so it may make more sense to re-code the variable into Alone, Small, Medium, and Large, as shown below.
titanic_train['family_size'] = titanic_train['sib_sp'] + titanic_train['parch'] +1# Categorize family sizedef categorize_family_size(size):if size ==1:return"Alone"elif2<= size <=3:return"Small"elif4<= size <=5:return"Medium"else:return"Large"titanic_train['family_size'] = titanic_train['family_size'].apply(categorize_family_size)# Plot the bar chartplt.figure(figsize=(10, 6))sns.histplot( data=titanic_train, x='family_size', hue='survived', multiple='fill', shrink=0.8, palette='Set1')plt.xlabel("Family Size")plt.ylabel("Proportion")plt.title("Proportion of Survival by Family Size")plt.show()
Interestingly, members of large families were less likely to survive, as were passengers traveling with no family members. Members of medium and small families were more likely to survive. It may be that the bigger the family, the more likely the family had children, and children were prioritized for rescue – however, that doesn’t explain the drop in survival chances for members of large families.
Ticket Size
It might seem redundant to make another feature using the information from ticket, but this variable may capture additional information. Not all passengers on the Titanic were traveling with family members. Many, like the example in the “missing data” section, traveled with servants or staff, who were listed on the same ticket with them but not counted as family members.
Here we create a variable called ticket_size and look at the relationship between it and survived.
We can see that the largest number of passengers on the same ticket was \(6\); the smallest was \(1\). Interestingly, passengers in a group of 3 were most likely to survive, while those in a group of 5 were least likely.
Title
There is very likely to be some useful information in the name variable, although it is trickier to access.
Below, we split the name variable into pieces at the comma, which separates the last name from first, middle, and title. All passengers have a title at the beginning, with a period following it. We can use regular expressions, as seen below, to extract the titles for each passenger. We then use fct_lump to combine the less common titles (with fewer than 4 passengers) together into an "Other" category and look at the relationship between passenger title and survival.
titanic_train['title'] = titanic_train['name'].apply(lambda x: re.search(r",\s*(.*?)(\.|$)", x).group(1) if re.search(r",\s*(.*?)(\.|$)", x) elseNone)title_counts = titanic_train['title'].value_counts()top_titles = title_counts.nlargest(4).indextitanic_train['title'] = titanic_train['title'].where(titanic_train['title'].isin(top_titles), 'Other')titanic_train['title'] = titanic_train['title'].astype('category')plt.figure(figsize=(10, 6))sns.histplot( data=titanic_train, y='title', hue='survived', multiple='fill', shrink=0.8, palette='Set1')plt.ylabel("Title")plt.xlabel("Proportion")plt.title("Proportion of Survival by Title")plt.show()
Older men (with titles of Mr. rather than Master) were the least likely to survive, while married women (with titles of Mrs.) were the most likely.
Finalized Dataset
In R, we don’t need to manually dummy-code (or one-hot encode) the categorical features or transform the outcome variable, because tidymodels provides us with recipe steps we can use.
In Python, however, we need to dummy-code the features and re-code the target ourselves. Here I provide the code to do this.
We should make sure to remember to apply all the feature engineering done above to both the training and the testing set.
# Process the training datasettitanic_train = titanic_train.drop(columns=['ticket', 'passenger_id', 'name', 'sib_sp', 'parch', 'fare'])titanic_train['survived'] = titanic_train['survived'].astype('category')titanic_train['survived'] = titanic_train['survived'].map({'Yes': 1, 'No': 0})# Process the test datasettitanic_test['fare_binned'] = pd.qcut(titanic_test['fare'], q=13, labels=False) +1titanic_test['family_size'] = titanic_test['sib_sp'] + titanic_test['parch'] +1def categorize_family_size(size):if size ==1:return"Alone"elif2<= size <=3:return"Small"elif4<= size <=5:return"Medium"else:return"Large"titanic_test['family_size'] = titanic_test['family_size'].apply(categorize_family_size)ticket_sizes = titanic_test.groupby('ticket').size().reset_index(name='ticket_size')titanic_test = titanic_test.merge(ticket_sizes, on='ticket')titanic_test['ticket_size'] = titanic_test['ticket_size'].astype('category')titanic_test['title'] = titanic_test['name'].apply(lambda x: re.search(r",\s*(.*?)(\.|$)", x).group(1) if re.search(r",\s*(.*?)(\.|$)", x) elseNone)title_counts = titanic_test['title'].value_counts()top_titles = title_counts.nlargest(4).indextitanic_test['title'] = titanic_test['title'].where(titanic_test['title'].isin(top_titles), 'Other')titanic_test['title'] = titanic_test['title'].astype('category')titanic_test = titanic_test.drop(columns=['ticket', 'passenger_id', 'name', 'sib_sp', 'parch', 'fare'])# One-hot encode categorical features for training and test setstitanic_train = pd.get_dummies(titanic_train, drop_first=False)titanic_test = pd.get_dummies(titanic_test, drop_first=False)# Align columns of training and testing data to ensure they have the same feature settitanic_train, titanic_test = titanic_train.align(titanic_test, join='outer', axis=1, fill_value=0)# Show final datasetsprint(titanic_train.head())
Notice that we forgot to center and scale the continuous features. Try implementing those transformations on both the training and testing sets.
Since deck is clearly related to pclass, it might make sense that we could impute the missing values of deck using information from pclass. Try imputing the missing values via KNN.
After one-hot encoding, we have considerably increased the size of the dataset. One way we can approach solving this problem is to remove any variables whose variance is near zero. In Python, code to accomplish this is: